A Span Extraction Approach for Information Extraction on Visually-Rich Documents
نویسندگان
چکیده
Information extraction (IE) for visually-rich documents (VRDs) has achieved SOTA performance recently thanks to the adaptation of Transformer-based language models, which shows great potential pre-training methods. In this paper, we present a new approach improve capability model on VRDs. Firstly, introduce query-based IE that employs span instead using common sequence labeling approach. Secondly, extend formulation, propose training task focusing modelling relationships among semantic entities within document. This enables target spans be extracted recursively and can used pre-train or as an downstream task. Evaluation three datasets popular business (invoices, receipts) our proposed method achieves significant improvements compared existing models. The also provides mechanism knowledge accumulation from multiple tasks.
منابع مشابه
Towards a Semantic Information Extraction Approach from Unstructured Documents
Recognizing and extracting meaningful information from semiand unstructured documents, taking into account their semantics, and storing them into database is an important problem in the context of information access and retrieval. This paper describes a novel logic-based approach to information extraction from both semiand unstructured documents. The approach, implemented in the HıLεX system, i...
متن کاملA Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain
The Internet is a generous source of information. Semi-structured text documents represent great part of that information; commercial data-sheets of the Information Technology domain are among them (e.g. laptop computer datasheets). However our capacity to automatically gather and manipulate such information is limited due to the fact that those documents are designed to be read by people. Many...
متن کاملInformation Extraction Strategies for Thai Documents
The development of an information extraction (IE) system for Thai documents raises a number of issues which are not important for IE in English and other European languages. We describe the characteristics of written Thai and the problem statements, and our approach to the Thai IE system. The structure of written Thai is highly ambiguous, which requires more sophisticated techniques than are ne...
متن کاملFew-exemplar Information Extraction for Business Documents
The automatic extraction of relevant information from business documents (sender, recipient, date, etc.) is a valuable task in the application domain of document management and archiving. Although current scientific and commercial self-learning solutions for document classification and extraction work pretty well, they still require a high effort of on-site configuration done by domain experts ...
متن کاملInformation extraction for semi-structured documents
The number of unstructured or semi-structured documents produced in all types of organizations continues to increase rapidly. Cost-effective ways of finding the relevant ones and extracting useful information from them are increasingly important to a large number of enterprises for operational and decision-support applications. The approach discussed in this paper constitutes a suitable basis f...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Lecture Notes in Computer Science
سال: 2021
ISSN: ['1611-3349', '0302-9743']
DOI: https://doi.org/10.1007/978-3-030-86159-9_25